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Abstract 

Collaborative filtering (CF) aims to predict users' ratings on items 
according to historical user-item preference data. In many real- 
world applications, preference data are usually sparse, which would 
make models overfit and fail to give accurate predictions. Recently, 
several research works show that by transferring knowledge from 
some manually selected source domains, the data sparseness prob- 
lem could be mitigated. However for most cases, parts of source 
domain data are not consistent with the observations in the tar- 
get domain, which may misguide the target domain model build- 
ing. In this paper, we propose a novel criterion based on empirical 
prediction error and its variance to better capture the consistency 
across domains in CF settings. Consequently, we embed this cri- 
terion into a boosting framework to perform selective knowledge 
transfer Comparing to several state-of-the-art methods, we show 
that our proposed selective transfer learning framework can signif- 
icantly improve the accuracy of rating prediction tasks on several 
real- world recommendation tasks. 

Keywords: Transfer Learning; Collaborative Filtering; Cross Do- 
main Recommendation; 

1 Introduction 

Recommendation systems attempt to recommend items (e.g., 
movies, TV, books, news, images, web pages, etc.) that are 
likely to be of interest to users. As a state-of-the-art tech- 
nique for recommendation systems, collaborative filtering 
aims at predicting users' ratings on a set of items based on 
a collection of historical user-item preference records. In 
the real- world recommendation systems, although the item 
space is often very large, users usually rate only a small num- 
ber of items. Thus, the available rating data can be extremely 
sparse for each user, which is especially true for new online 
services. Such data sparsity problem may make CF mod- 
els overfit the limited observations and result in low-quality 
predictions. 

In recent years, different transfer learning techniques 
have been developed to improve the performance of learning 
a model via reusing some information from other relevant 
systems for collaborative filtering ifTTl l27l . And with the 
increasing understandings of auxiliary data sources, some 
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works (like fS", "22l) start to explore data from multiple 
source domains to achieve more comprehensive knowledge 
transfer. However, these previous methods over-trust the 
source data and assume that the source domains follow the 
very similar distributions with the target domain, which is 
usually not true in the real- world applications, especially 
under the cross domain CF settings. For example, in a local 
music rating web site, natives may give trustful ratings for 
the traditional music; while in an international music rating 
web site, the ratings on those traditional music could be 
diverse due to the culture differences: those users with good 
culture background would constantly give trustful ratings, 
others could be inaccurate. If the target domain task is the 
music recommendation of a startup local web site, obviously 
we do not want all the International web site's data as source 
domain without selection. To better tackle the cross domain 
CF problems, we face the challenge to tell how consistent the 
data of target and source domains are and adopt only those 
consistent source domain data while transferring knowledge. 

Several research works (like |2J) have been proposed to 
perform instance selection across domains for classification 
tasks based on empirical error. But they cannot be adopted 
to solve CF problems directly. Especially when the target 
domain is sparse, because of the limited observations of 
user's ratings on the items in the target domain, getting a 
low empirical error occasionally in the target domain does 
not mean the source domains are truly helpful in building 
a good model. In other words, the inconsistent knowledge 
from source domains may dominate the target domain model 
building and happen to fit the few observations in the target 
domain, which gives high accuracy unacceptably. 

We take careful analysis on this problem and in our 
observation on the music rating example, some users, such 
as domain experts, follow standard criteria to rate and hence 
share a consistent distribution over the mutual item set across 
domains. And further, we find this consistency can be better 
described by adding the variance of empirical error produced 
by the model. The smaller the variance of empirical error on 
predictions for a user, the more likely this user is consistent 
with those from other domains. And we would like to adopt 
those who are more likely to share consistent preferences to 
perform knowledge transfer across domains. Based on this 
observation, we propose a new criterion using both empirical 
error and its variance to capture the consistency between 
source and target domains. As an implementation, we 
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1^ Inconsistent with target ^ Consistent with target 

Fi gure 1 ! Selective transfer learning with multiple sources. The first row 
illustrates the case where items that are in the target domain also appear in 
the source domains. A real- world example is the rating prediction for the 
movies that appear in several web sites in various forms; the second row 
illustrates the case where users that are in the target domain also appear in 
the source domains. A real- world example is the Douban recommendation 
system, which provides music, book and movie recommendation for users. 

embed this criterion into a boosting framework and propose 
a novel selective transfer learning approach for collaborative 
filtering (STLCF). STLCF works in an iterative way to adjust 
the importance of source instances, where those source data 
with low empirical error as well as low variance will be 
selected to help build target models. 

Our main contributions are summarized as follows: 

• First, we find that selecting consistent auxiliary data 
for the target domain is important for the cross-domain 
collaborative filtering, while the consistency between 
source and target domain is influenced by multiple 
factors. To describe these factors, we propose a novel 
criterion, based on both empirical error and its variance. 

• Second, we propose a selective transfer learning frame- 
work for collaborative filtering - an extension of the 
boosting based transfer learning algorithm that take the 
above factors into consideration, so that the sparseness 
issue in the CF problems can be better tackled. 

• Third, the proposed framework is general, where dif- 
ferent base models can be embedded. We propose an 
implementation based on Gaussian probability latent 
semantic analysis, which demonstrates the proposed 
framework can solve the sparseness problem on various 
real- world applications. 

2 Preliminaries 

2.1 Problem Settings Suppose that we have a target task 

V where we wish to solve the rating prediction problem. 
Taking the regular recommendation system for illustration, 

V is associated with rud users and rid items denoted by 
and respectively. In this task, we observe a sparse matrix 



X(^) G R^rfx^^ with entries xjj^. Let R^^) = {{u,i,r) : 
r = x^^, where xf^^ 7^ 0} denote the set of observed links 
in the system. For the rating recommendation system, r can 
either take numerical values, for example [1,5], or binary 
values {0, 1}. We aim to transfer knowledge from other 
N source domains S = {S^}^^i with each source domain 
contains m\ users and n\ items denoted by W and 
. Similar to the target domain, each source domain 



contain sparse matrices X*^^*^ G R^^^x^s* and observed 



links R*^^*^ = {{u^ i^r) : r = x^* , where x^* 7^ 0}. 

The settings of STLCF are illustrated in Figure [T] We 
adopt a setting commonly used in transfer learning for 
collaborative filtering: either the items or the users that are in 
the target domain also appear in the source domains. In the 
following derivation and description of our STLCF model, 
for the convenience of interpretation, we focus on the case 
that the user set is shared by both target domain and the 
source domains. The case that the item set is shared can 
be easily tackled in a similar manner. 

Under the assumption that the observation R(i^'^ }) is 
obtained with u and i being independent, we formally define 
a co-occurrence model in both the target and the source 
domains to solve the collaborative filtering problem: 

Pr{x\^^^ ^ = r^u^i)=Pr{u)Pr{i \ u)Pr{x\^^^ ^ = r \ u^i) 
=Pr{u)Pr{i)Pr{x\^^^ ^ = r \ u^i) 
(xPr{x\^^^ ^ = r \ u^i) 

In the following, based on Gaussian probabilistic la- 
tent semantic analysis (GPLSA), we first briefly present a 
transfer learning model for collaborative filtering - trans- 
ferred Gaussian probabilistic latent semantic analysis (TG- 
PLSA) as an example, which is designed to integrate into 
our later proposed framework as a base model. After that, 
we present our selective transfer learning for collaborative 
filtering (STLCF) to perform knowledge transfer by analyz- 
ing the inconsistency between the observed data in target do- 
main and the source domains. Careful readers shall notice 
that other than the TGPLSA example, STLCF is compatible 
to use various generative models as the base model. 

2.2 Collaborative Filtering via Gaussian Probabilistic 
Latent Semantic Analysis (GPLSA) For every user-item 
pair, we introduce hidden variables Z with latent topics 
z, so that user u and item i are rendered conditionally 
independent. With observations of item set V, user set U 
and rating set R in the source domain, we define a model as: 

Pr{xui = r\u, i) = ^ Pr{xui = r \ i, z)Pr{z \ u) 

z 

We further investigate the use of a Gaussian model for 
estimating p{xui = r\u,i) by introducing /i^^ G 1Z for the 



mean and cr|^ for the variance of the ratings. With these, we 
define a Gaussian mixture model for a single domain as: 

Pr{xui = r\u,i) = ^ Pr{z\u)Pr{r; iij^, diz) 



where Pr{z\u) is the topic distribution over users, and 
Pr(r; iiiz^cFiz) follows a Gaussian distribution. 

Maximum likelihood estimation amounts to minimize: 

(2.1) C = -Y, log[^K^m = r I u, I- e)\ 

where 6> is a generic parameter referring to a particular 
model. 

Next, we extend GPLSA to the cross domain context to 
achieve transfer learning for collaborative filtering (TLCF). 

2.3 Transfer Learning for Collaborative Filtering 
(TLCF) When the target data X^^) is sparse, GPLSA may 
overfit the limited observed data. Following the similar idea 
in [25 1, we extend GPLSA to the Transferred Gaussian Prob- 
abilistic Latent Semantic Analysis (TGPLSA) model. Again 
we use s to denote index of the source domain where the 
knowledge come from, and d to denote the index of the tar- 
get domain where the knowledge is received. For simplicity, 
we present the work with one source domain, and this model 
can be easily extended to multiple source domains. More- 
over, we assume all the users appear in both the source do- 
mains and the target domain. Such scenarios are common in 
the real- world systems, like DoubanM 

Similar to the approach in (251, the TGPLSA jointly 
learn the two models for both the source domain and the 
target domain using a relative weight parameteij^ A. Since 
the item sets and are different or even disjoint with 
each other, there could be inconsistency across domains as 
we discussed in Section [T] Clearly, the more consistent 
source and target domains are, the more help target task 
could get from source domain(s). We are motivated to 
further analyze this inconsistency in our work by learning 



Algorithm 1 Selective TLCF. 



of 



item weight vectors = and 
the instances in source and target domain respectively. Then, 



the objective function in Eq.(2.1 ) can be extended as: 



(2.2) 

+ (1-A) ^ log(^f •Pr(x^^ =r I ii^z^;6>^))) 



%ttp://www. douban.com - a widely used online service in China, which 
provides music, book and movie recommendations. 

G (0, 1), which is introduced to represent the tradeoff of source and 
target information. 



Input: X^, X^ T 

-^d ^ j^mxn'^. ^1^^ training data 
X* G M^x"^^: the auxiliary source data 
G: the weighted TLCF model wTGPLSA 
T: number of boosting iterations 



Initialize: Initialize : 

for iter = 1 to T do 

Step 1: Apply G to generate a weak learner 

G(X'^, X^ W^, W^) that minimize Eq.([l2) 

Step 2: Get weak hypothesis for both the d and s 

domains h'^^"^ : X'^, X^ ^ X^, X^ 

Step 3: Calculate empirical error and using 

Step 4: Calculate fitness weight (3^'' for each source 



domain s k using Eq. (3.13) 

Step 5: Choose model weight a*^^^ via Eq.(3.9) 



Step 6: Update source item weight via Eq .(|3.12| ) 
Step 7: Update target item weight via Eq.( |3.11| ) 
end for 

Output: Hypothesis Z = H{X^^^) = J:J^^a^h\X^^^) 



We adopt the expectation-maximization (EM) algo- 
rithm, a standard method for statistical inference, to find the 
maximum log-likelihood estimates of Eq.(2.2). Details of 
derivations can be found in the appendix. 



3 Selective TLCF 

As we have discussed before, using the source domain data 
without selection may harm the target domain learning. By 
proposing the selective knowledge transfer with the novel 
factors (empirical error and variance of empirical error), 
we come up with the details of Selective Transfer Learning 
framework for CF in this section. As illustrated in the second 
example in Figure [T] where the domains have mutual user set, 
we would like to transfer knowledge of those items' records 
that consistently reflect the user's preferences. Because of 
our finding that the consistent records have small empirical 
error and variance, the selection shall consider these two fac- 
tors. We embed these two factors into a boosting framework, 
where the source data with small empirical error and vari- 
ance receive higher weights since they are consistent with 
the target data. This boosting framework models the cross- 
domain CF from two aspects: on one hand, we take more 
care of those mis-predicted target instances; on the other 
hand, we automatically identify the consistency of the source 
domains during the learning process and selective use those 
source domains with more trustful information. 

As shown in Algorithm [T] in each iteration, we apply 
base model TGPLSA over weighted instances to build a 



weak learner G{-) and hypothesis h'^^^^. Then to update 
the source and target item weights, domain level fitness 
weight (3^^ is chosen for each source domain Sk based 
on domain level consistency IH. And a**^^ for base 
model is also updated, considering empirical errors and 
variances. Accordingly, the weights of mis-predicted target 
items are increased and the weights of those less helpful 
source domains are decreased in each iteration. The final 
ensemble is given by an additive model, which gives larger 
weights to the hypotheses with lower errors. We provide a 
detailed derivation of STLCF in the rest of this section. 

In previous works in collaborative filtering, the mean 
absolute error (MAE) is usually chosen as the loss function. 
If we tolerate some prediction error r, we define: 
(3.3) 

-1, ^ \xui-Xui\<r ' nnz{X^i) 



^1 5 -^>Ki) 



1, ^ \xui-Xui\>T • nnz{^^i) 



where nnz{-) is the number of observed ratings. X^^ and 
X,^^ denote the true values and predictions respectively. We 
may also define the item level MAE error for target domain 
with respect to r as: 



(3.4) 



ef — /i(X^^, X^J 



To facilitate the optimization, we consider the following 
exponential loss for empirical risk minimization: 



(3.5) 



As stated in previous section, the lower variance of empirical 
errors can provide more confident consistency estimation, 
we combine these factors and reformulate the loss function: 



(3.6) C = ^l2{i)+7 



\ 



i>j 



Above all, the model minimize the above quantity for some 
scalar 7 > 0: 

Assume that the function of interest H for prediction is 
composed of the hypothesis from each weak learner. The 
function to be output would consist of the following additive 
model over the hypothesis from the weak learners: 



(3.7) 



where G . 

Since we are interested in building an additive model, 
we assume that we already have a function h{-). Subse- 
quently, we derive a greedy algorithm to obtain a weak 
learner G^(-) and a positive scalar such that /(•) = 



In the following derivation, for the convince of presen- 
tation, we omit the model index t, and use G to represent G\ 
a to represent a^. 

By defining 71 = (1 + (n - 1)7), 72 = (2 - 27), a, 
^ g^i(Mx^,),x^j Qd ^ /^(G(X^J,X^J, Eq.ra 



can be equivalently posed as optimizing the following loss 
with respect to a: 



(3.8) 



7i 



E 



{w, 



d\2^-2a\ 



■72 E 



J 



n E "'^''"^^ 

i>j:i,j€J 



+ 72 E 

i>j:i^I,j^Jori^J,j^I 

For brevity, we define the following sets of indices as / = 
{i : Gf = +1} and J = {i : Gf = -1}. Here J denotes 
the set of items whose prediction by G(-) falls into the fault 
tolerable range, while / denotes the rest set. By making the 
last transformation in Eq.(|3.8|) equal to zero, we get: 



(3.9) 



a 



log 



If we set 7 = 0, then it is reduced to the form of AdaBoost: 
(3.10) 



1 



a 



log 




\(EiGJ' 

Finally, the updating rule for wf is 
(3.11) wf ^ wfe^-""^^^ 

And for the instance weight wf in the source domain, we can 
also adopt the similar updating rule in Eq.( |3.11| ). 

Other than the instance level selection discussed above, 
we also want to perform the domain level selection to 
penalize those domains that are likely to be irrelevant, so 
that the domains with more relevant instances speak loudly. 
Following the idea of task-based boosting |3|, we further 
introduce a re-weighting factor (3 for each source domain 
to control the knowledge transfer. So we formulate the 
updating rule for w^ to be: 



(3.12) 



w^ ^ w^e 



where (3 can be set greedily in proportion to the performance 
gain of the single source domain transfer learning: 



(3.13) 



/3 



E^fi^i-^i) 



llW^lli 

where £i is the training error of the transfer learning model, 
and Ei is the training error of the non-transfer learning model, 
which utilizes only the observed target domain data. 



Table 1: Datasets in our experiments. 



Notation 


Data Set 


Data Type 


Instances No. 


Dl 


Douban Music 


Rating [1,5] 


1.2 X 10^ 


D2 


Douban Book 


Rating [1,5] 


5.8 X 10^ 


D3 


Douban Movie 


Rating [1,5] 


1.4 X 10^ 


D4 


Netflix 


Rating [1,5] 


1.8 X 10^ 


D5 


Wikipedia 


Editing Log 


1.1 X 10^ 


D6 


IMDB 


Hyperlink 


5.0 X 10^ 



4 Experiments 

4.1 Data Sets and Experimental Settings We evaluate 
the proposed method on four data sources: Netfli;|5 Douban 
IIMDBQ and Wikipedi^ user editing records. The Netflix 
rating data contains more than 100 million ratings with 
values in {1,2, 3, 4, 5}, which are given by more than 4.8 x 
10^ users on around 1.8 x 10^ movies. Douban contains 
movie, book and music recommendations, with rating values 
also in {1,2, 3, 4, 5}. IMDB hyperlink graph is employed 
as a measure of similarity between movies. In the graph, 
each movie builds links to its 10 most similar movies. The 
Wikipedia user editing records provide a {0, 1} indicator of 
whether a user concerns or not about a certain movie. 

The data sets used in the experiments are described as 
follows. For Netflix, to retain the original features of the 
users while keeping the size of the data set suitable for 
the experiments, we sampled a subset of 10, 000 users. In 
Douban data sets, we obtained 1.2 x 10^ ratings on 7.5 x 10^ 
music, 5.8 x 10^ ratings on 3.5 x 10^ books, and 1.4 x 10^ 
ratings on 8 x 10^ movies, given by 1.1 x 10^ users. For both 
the IMDB data set and the Wikipedia data set, we filtered 
them by matching the movie titles in both the Netflix and the 
Douban Movie data sets. After pre-processing, the IMDB 
hyperlink data set contains ~ 5x10^ movies. The Wikipedia 
user editing records data set has 1.1 x 10^ editing logs by 
8.5 X 10^ users on the same ~ 5 x 10^ movies as IMDB 
data set. To present our experiments, we use the shorthand 
notations listed in Table [T] to denote the data sets. 

We evaluate the proposed algorithm on five cross- 
domain recommendation tasks, as follows: 

• The first task is to simulate the cross-domain collab- 
orative filtering, using the Netflix data set. The sam- 
pled data is partitioned into two parts with disjoint sets 
of movies but identical set of users. One part consists 
of ratings given by 8,000 movies with 1.6% density, 
which serves as the source domain. The remaining 
7, 000 movies are used as the target domain with dif- 
ferent levels of sparsity density. 

• The second task is a real-world cross-domain recom- 
mendation, where the source domain is Douban Book 



•^http://www. netflix . c om 
^http://www.imdb.com 
^ http :// en . wikipedia . org 



and the target domain is Douban Movie. In this setting, 
the source and the target domains share the same user 
set but have different item sets. 

• The third task is constructed with Netflix and Douban 
data. There are about 6, 000 shared movies in Netflix 
and Douban Movie. We extract the ratings on shared 
movies from Netflix and Douban Movie. Then we get 
4.9 X 10^ ratings from Douban given by 1.2 x 10^ users 
with density 0.7%, and 10^ ratings from Netflix given 
by 10^ users with density 1.7%. The goal is to transfer 
knowledge from the Netflix data set to Douban Movie. 
In this task, item set is identical across domains but user 
sets are totally different. 

• The fourth task is to evaluate the effectiveness of the 
proposed algorithm under the context of multiple source 
domains. It uses both Douban Music and Douban 
Book as the source domains and transfer knowledge to 
Douban Movie domain. 

• The fifth task varies the type of source domains. It 
utilizes the Wikipedia user editing records and IMDB 
hyperlink graph, together with Douban Movie as the 
source domains to perform rating predictions on the 
Netflix movie data set. 

For evaluation, we calculate the Root Mean Square 
Error (RMSE) on the heldout ^ 30% of the target data: 



RMSE = Yl i^m-Xm)y\TE\ 

where Xui and Xui are the true and predicted ratings, respec- 
tively, and \Te \ is the number of test ratings. 

4.2 STLCF and Baselines Methods We implement two 
variations of our STLCF method. STLCF(E) is an STLCF 
method that only take training error into consideration when 
performing selective transfer learning. STLCF(EV) not only 
considers training error, but also utilizes the empirical er- 
ror variance. To demonstrate the significance of our STLCF, 
we selected the following baseline^ PMF |20| is a re- 
cently proposed method for missing value prediction. Previ- 
ous work showed that this method worked well on the large, 
sparse and imbalanced data set. GPLSA 13 is a classical 
non-transfer recommendation algorithm. CMF 1241 is pro- 
posed for jointly factorizing two matrices. Being adopted as 
a transfer learning technique in several recent works, CMF 
has been proven to be an effective cross -domain recommen- 
dation approach. TGPLSA is an uniformly weighted trans- 
fer learning model, which utilize all source data to help build 
the target domain model. It is used as one of the baselines 
because we adopt it as the base model of our boosting-based 
selective transfer learning framework. 



'Parameters for these baseline models are fine-tuned via cross validation. 



Table 2: Prediction performance of STLCF and the baselines. 



Datasets 


Source 


Target 


Non-TL 


Non- Selective TL 


Selective TL 


sparseness 


sparseness 


GPLSA 


PMF 


TGPLSA 


CMF 


STLCF(E) 


STLCF(EV) 


D4(Simulated) 




0.1% 


1.0012 


0.9993 


0.9652 


0.9688 


0.9596 


0.9533 


to 


1.6% 


0.2% 


0.9839 


0.9814 


0.9528 


0.9532 


0.9468 


0.9347 


D4(Simulated) 




0.3% 


0.9769 


0.9728 


0.9475 


0.9464 


0.9306 


0.9213 






0.1% 


0.8939 


0.8856 


0.8098 


0.8329 


0.7711 


0.7568 


D2 to D3 


1.5% 


0.2% 


0.8370 


0.8323 


0.7462 


0.7853 


0.7353 


0.7150 






0.3% 


0.7314 


0.7267 


0.7004 


0.7179 


0.6978 


0.6859 






0.1% 


0.8939 


0.8856 


0.8145 


0.8297 


0.7623 


0.7549 


D4 to D3 


1.7% 


0.2% 


0.8370 


0.8323 


0.7519 


0.7588 


0.7307 


0.7193 






0.3% 


0.7314 


0.7267 


0.7127 


0.7259 


0.6982 


0.6870 



4.3 Experimental Results 

4.3.1 Performance Comparisons We test the perfor- 
mance of our STLCF methods against the basehnes. The 
results of the collaborative filtering tasks under three differ- 
ent target domain sparseness are shown in Table [2] 

First, we observe that the non-transfer methods, i.e. 
GPLSA and PMF, fail to give accurate predictions, espe- 
cially when the target domain is severely sparse. With the 
help of source domains, the (non-selective) transfer learn- 
ing methods with equally weights on the source domains, 
like TGPLSA and CMF, can increase the accuracy of the rat- 
ing predictions. And our selective transfer learning methods 
(i.e., STLCF(E) and STLCF(EV)) can do even better. The 
fact that our STLCF outperforms others is expected because 
by performing the selective knowledge transfer, we use the 
truly helpful source domain(s), which is designed to handle 
the sparseness issue in CF problems. 

Second, comparing the two non-selective TLCF meth- 
ods with the other two selective TLCF methods, we observe 
that on the last two real world tasks (D2 to D3 and D4 to D3) 
when the target domain is extremely sparse (say 0.1%), the 
improvement of accuracy achieved by our STLCF methods 
against the non- selective transfer learning methods is much 
more significant than it does on the simulation data set based 
on Netflix (D4 to D4). Notice that the inconsistency of the 
target domain and the source domains on the simulation data 
sets is much smaller than that on the real- world cases. The 
experiment results show that our STLCF algorithm is effec- 
tive in handling the inconsistency between the sparse target 
domain and the source domains. 

Third, we notice that some factors, like empirical error 
variance, may affect the prediction. In Table [2] we compare 
our two STLCF methods, i.e., STLCF(E) and STLCF(EV) 
when the target domain sparsity is 0.1%. We can find 
that on the task "D2 to D3", i.e., Douban Book to Movie, 
STLCF(EV) is much better than STLCF(E). But on the 
task "D4(Simulated) to D4(Simulated)", the improvement of 
STLCF(EV) is not so significant against STLCF(E). These 
observations may be due to the domain consistency. For 



Table 3 : Prediction performance of STLCF for Long-Tail Users on the D2 
to D3 task. STLCF(E) does not punish the large variance of empirical error, 
while STLCF(EV) does. 



Ratings 
per 


Non-TL 


Non- Selective TL 


Selective TL 
i.e. STLCF 


user 


GPLSA 


TGPLSA 


CMF 


(L) 


(LV) 


1-5 


1.1942 


0.9294 


0.9312 


0.8307 


0.8216 


6-10 


0.9300 


0.7859 


0.7929 


0.7454 


0.7428 


11-15 


0.8296 


0.7331 


0.7390 


0.7143 


0.7150 


16-20 


0.7841 


0.7079 


0.7113 


0.7042 


0.7050 


21-25 


0.7618 


0.6941 


0.6947 


0.6942 


0.6910 


26-30 


0.7494 


0.6918 


0.6884 


0.6917 


0.6852 


31-35 


0.7370 


0.6909 


0.6911 


0.6915 


0.6818 


36-40 


0.7281 


0.6896 


0.6856 


0.6907 


0.6776 


41-45 


0.7219 


0.6878 


0.6821 


0.6890 


0.6740 


46-50 


0.7187 


0.6881 


0.6878 


0.6800 


0.6734 



the tasks "D4(Simulated) to D4(Simulated)", both the source 
and target entities are movie ratings from Netflix data set, 
while the task "D2 to D3" tries to transfer the knowledge 
from a book recommendation system to the moive recom- 
mendation system, which may contain some domain spe- 
cific items. When the target domain is very sparse, i.e. the 
user's ratings on the items are rare, there are chances to get 
high prediction accuracy occasionally on the observed data 
with a bad model on the source domains that are inconsis- 
tent with target domain. In this case, it is important to con- 
sider the variance of empirical error as well. Comparing to 
STLCF(E), STLCF(EV), which punishes the large variance, 
can better handle the domain inconsistency in transfer learn- 
ing, especially when the target domain is sparse. 

4.3.2 Results on Long-Tail Users To better understand 
the impact of STLCF with the help of the source domain, we 
conduct a fine-grained analysis on the performance improve- 
ment on Douban data sets, with Douban Book as source do- 
main and Douban Movie as target domain. The results on 
different user groups in the target domain are shown in Table 
m First, we observe that the STLCF models, i.e., STLCF(E) 
and STLCF(EV) can achieve significantly better results on 
those long-tail users who have very few ratings in historical 
logs. Such fact implies that our STLCF methods could han- 



Table 4: Prediction performance of STLCF with multiple source domains containing much irrelevant information. 



Source Domain: 


None 


D3 


D3&D5 


D3&D6 


D5&D6 


D3 & D5 & D6 


Target 


0.1% 


0.9983 


0.9789 


0.9747 


0.9712 


0.9923 


0.9663 


(D4) 


0.2% 


0.9812 


0.9625 


0.9583 


0.9572 


0.9695 


0.9505 


sparseness 


0.3% 


0.9703 


0.9511 


0.9409 


0.9464 


0.9599 


0.9383 



Table 5 : Prediction performance of STLCF with multiple source domains 
(Douban). 



Source Domain: 


None 


Dl 


D2 


Dl &D2 


Target 


0.1% 


0.8856 


0.7521 


0.7568 


0.7304 


(D3) 


0.2% 


0.8323 


0.7163 


0.7150 


0.6904 


sparseness 


0.3% 


0.7267 


0.6870 


0.6859 


0.6739 



die the long-tail users that really need a fine-grained anal- 
ysis when performing knowledge transfer from source do- 
mains. Current CF models without any fine-grained anal- 
ysis on the specific users usually fail to capture the prefer- 
ences of the long-tail users, while our STLCF methods work 
well because they can selectively augment the weight of the 
corresponding source domain instances with respect to those 
long-tail cases at both instance level and domain level. Sec- 
ond, STLCF(EV) works better than STLCF(E) on those non- 
long-tail users, i.e., with more than 25 ratings per user in 
the historical log. This is expected because users with more 
ratings can benefit more from the error variance analysis to 
avoid negative knowledge transfer. 

4.3.3 STLCF with Multiple Source Domains We apply 
STLCF(EV) on the extremely sparse target movie domain, 
with two sets of source domains: one is composed of 
Douban Music and Douban Book, the other is composed 
of Douban Movie, IMDB hyperlink graph and Wikipedia 
user editing records. The results are in Table [5] and Table |4] 
respectively. We demonstrate our STLCF method can utilize 
multiple source domains of various types by handling the 
inconsistency between the target and the source domains. 

First, for the Douban experiments shown in Table [5] we 
observe that comparing to only using either Douban Book 
or Douban Music as source domain, there are significant 
improvements when both of them are used. The result is 
expected because each of the source domains has its own 
parts of effective information for the target domain. For 
example, a user who show much interests in the movie "The 
Lord of the Rings" may have consistent preferences in its 
novel. In this case, with the help of more auxiliary sources, 
better results are expected. 

Second, we explore the generalization of the choices of 
source domains by introducing domains like Wikipedia user 
editing records and IMDB hyperlink graph, which are not 
directly related to the target domain but still contain some 
useful information in helping the target task (Netflix rating 



prediction). The results are shown in Table [4] Comparing 
the results of the experiment that uses no source domain 
(non-transfer) to those that use source domains D5 & D6, 
we observe that although the Wikipedia user editing records 
or IMDB hyperlink graph is not closely related to the target 
domain and can hardly be adopted as source domains by 
previous transfer learning techniques, our STLCF method 
can still transfer useful knowledge successfully. In addition, 
comparing the results of the experiment that uses single 
source domain D3 to those that use source domains D3 
& D5, D3 & D6, or D3 & D5 & D6, we find that the 
Wikipedia user editing records or IMDB hyperlink graph 
could provide some useful knowledge that is not covered by 
the related movie source domains. Despite of the noise and 
heterogeneous setting, our STLCF method can still utilize 
these source domains to help the target domain tasks. As we 
have discussed in Section |3] our STLCF performs selective 
transfer learning at both domain level and instance level. On 
one hand, the domain level selective transfer can block the 
noisy information globally. As we can see, D5 & D6 are 
noisy and therefore contain much data that are inconsistent 
with the observed data in the target domain, therefore the 
overall transfer of D5 & D6 is penalized. On the other 
hand, the instance level selective transfer learning can further 
eliminate the affections of those irrelevant source instances. 

Above all, our STLCF is highly adaptive to utilize 
source domains that are relatively inconsistent with the target 
domain, even when the target domain is rather sparse. 

4.3.4 Parameters Analysis of STLCF There are two pa- 
rameters in our STLCF, i.e., the prediction error threshold r 
and the empirical error variance weight 7. Since r and 7 are 
independent, we fix one and adjust another. 

We fix the empirical error variance weight to be 7 = 0.5 
and adjust the parameter r. Based on our results shown in 
Figure|2] the model has good performance when r is of order 
10~^. We also tuned the parameter 7, which balances the 
empirical error and its variance. We fix the prediction error 
threshold to be r = 0.03 in tuning 7. As shown in Figure [3] 
when we vary the parameter 7 from to 1, the best choices 
of 7 are found to be around 0.4 — 0.5. 

4.3.5 Convergence and Overfitting Test Figure |4] shows 
the RMSEs of STLCF(EV) as the number of weak learners 
changes on the Douban Book to Movie task. From the figure 
on the left, we observe that STLCF(EV) converges well after 
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Figure 2: Change of the RMSEs with different rs. 
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Figure 3: Change of the RMSEs with different 7s. 
Table 6: Overview of STLCF in a big picture of collaborative filtering 





Selective 


Non- Selective 


Transfer 


STLCF 


RMGM |11|,CMF |24|, 


Learning 




TIF 1 15 1, etc. 


Non-Transfer 




MMMF 1 19], GPLSA |7 1, 


Learning 




PMF|20|,etc. 



40 iterations. We can also find that the corresponding a also 
converge to around 0.68 after 40 iterations as well. 

The number of latent topics of the base learner TGPLS A 
reflects the model's ability to fit training data. When we 
keep increasing the number of latent topics, the model tends 
to better fit the training data. But if the number of latent 
topics is too large, the model may suffer from overfitting. We 
investigate the overfitting issue by plotting the training and 
testing RMSEs of the non-transfer learning model GPLSA, 
the non-selective transfer learning model TGPLSA and our 
selective transfer learning model STLCF(EV) over different 
numbers of latent topics in Figure |5] The data sparsity for 
the target domain is around 0.3%. 

We can observe that comparing to our STLCF, the 
training RMSEs of GPLSA and TGPLSA decrease faster, 
while their testing RMSEs go down slower. When k is about 
50, the testing RMSEs of GPLSA start to go up. And for 
TGPLSA, its testing RMSEs also go up slightly when k is 
larger than 75. But the testing RMSEs of our STLCF keep 
decreasing until k = 125 and even when k is larger than 
125, the raise of our STLCF's testing RMSEs is not obvious. 
Clearly when the target domain is very sparse, our STLCF 
method is more robust against the overfitting, by inheriting 
the advantage from boosting techniques and the fine-grained 
selection on knowledge transfer. 

5 Related Works 

The proposed Selective Transfer Learning for Collaborative 
Filtering (STLCF) algorithm is most related to the works 
in collaborative filtering. In Table [6j we summarize the 
related works under the collaborative filtering context. To 
the best of our knowledge, no previous work for transfer 
learning on collaborative filtering has ever focused on the 
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Figure 4: Change of the RMSEs and as when more and more weak 
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Figure 5: Change of the RMSEs with different numbers of latent topics. 

fine-grained analysis of consistency between source domains 
and the target domain, i.e., the selective transfer learning. 

Collaborative Filtering as an intelligent component 
in recommender systems has gained extensive interest in 
both academia and industry. Various models have been 
proposed, including factorization models lllQlll51ll61ll81ll9i , 
probabilistic mixture models [ S^^SJ. Bayesian networks ifTTIl 
and restricted Boltzman machines I2TII . However, most of 
the previous work would suffer from overfitting to the small 
set of observed data. In this paper, we introduce the concept 
of selective transfer learning to better tackle the overfitting 
and data sparseness issue. 

Transfer Learning fT4l utilizes data sets from related 
but different domains to build model for a target application 
domain. A few works on transfer learning are in the context 
of collaborative filtering. Mehta and Hofmann 1 131 consider 
the scenario involving two systems with shared users and use 
manifold alignment methods to jointly build neighborhood 
models for the two systems. They focus on making use of an 
auxiliary recommender system when only part of the users 
are aligned, which does not distinguish the consistency of 
users' preferences among the aligned users. Li et al. 1 12l 
designed a regularization framework to transfer knowledge 
of cluster-level rating patterns, which does not make use of 
the correspondence between source and target domains. 

Recently, researchers propose the MultiSourceTrAd- 
aBoost |26| to allow automatically selecting the appropri- 
ate data for knowledge transfer from multiple sources. The 
newest work TransferBoost |3J was proposed to iteratively 



construct an ensemble of classifiers via re- weighting source 
and target instance via both individual and task-based boost- 
ing. Moreover, EB Boost | 23 1 suggests weight the instance 
based on the empirical error as well as its variance. However 
so far, the works limit to the classification tasks. Our work is 
the first to systematically study selective knowledge transfer 
in the settings of collaborative filtering. Besides, we propose 
the novel factor - variance empirical error that is proven to 
be of much help in solving the real world CF problems. 

6 Conclusions 

In this paper, we proposed to perform selective knowledge 
transfer for CF problems and came up with a systematical 
study on how the factors such as variance of empirical 
error could leverage the selection. We found although 
empirical error is effective to model the consistency across 
domains, it would suffer from the sparseness problem in 
CF settings. By introducing a novel factor - variance of 
empirical error to measure how trustful this consistency is, 
the proposed criterion can better identify the useful source 
domains and the helpful proportions of each source domain. 
We embedded this criterion into a boosting framework to 
transfer the most useful information from the source domains 
to the target domain. The experimental results on real- 
world data sets showed that our selective transfer learning 
solution performs significantly better than several state-of- 
the-art methods at various sparsity levels. Furthermore, 
comparing to existing methods, our solution works well on 
long-tail users and is more robust to overfitting. 
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